-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cortex kv panel improvements #371
base: main
Are you sure you want to change the base?
Conversation
Distributor uses multiple kv stores - for global limits and ha-tracker, as well as reading from the ingester ring - so we need to narrow the panel to just the one it says it is showing. For consistency, do the same on the ingester panel, although currently ingesters only have one kv store. Note that the renamed recording rule will mean that dashboards show no data for latency prior to the change.
58b64a7
to
193c647
Compare
For HA-tracker, show which tenants are changing election. For ingester, show how many are active, leaving, etc.
193c647
to
76b89ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created as Draft because the renamed recording rule will mean that dashboards show no data for latency prior to the change. Do we want to cope with this? We could add an extra three queries on the latency panels with the old name, but it will get quite ugly in the code. Discuss.
If it was the read/write latency, I would have said definitely yes. In this case it's the KV latency so we can probably accept the fact we'll "loose the history" when watching this dashboard in the past (or over la large period), so to me it's OK.
.addPanel( | ||
$.panel('Elected replica changes / min') + | ||
$.queryPanel([ | ||
'max by(exported_cluster, user)(increase(cortex_ha_tracker_elected_replica_changes_total{%s}[1m])) >0' % $.jobMatcher($._config.job_names.distributor), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a pretty high cardinality query. For example, I just tested in a cluster with few thousands of tenants and running it over "Last 12h" took 25s on cold caches. What if we add a recording rule for that? I would expect the > 0
narrows down a lot the cardinality (other than squashing it by a "distributors replicas" number factor).
What this PR does:
Distributor uses multiple kv stores - for global limits and ha-tracker, as well as reading from the ingester ring - so we need to narrow the panel to just the one it says it is showing, e.g.
kv_name="distributor-hatracker"
.For consistency, do the same on the ingester panel, although currently ingesters only have one kv store.
Created as Draft because the renamed recording rule will mean that dashboards show no data for latency prior to the change. Do we want to cope with this? We could add an extra three queries on the latency panels with the old name, but it will get quite ugly in the code. Discuss.
Also, I added a couple of panels with more info on what the KV store is doing.
Checklist
CHANGELOG.md
updated